13 research outputs found

    ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language Models

    Full text link
    Protein language models (pLMs), pre-trained via causal language modeling on protein sequences, have been a promising tool for protein sequence design. In real-world protein engineering, there are many cases where the amino acids in the middle of a protein sequence are optimized while maintaining other residues. Unfortunately, because of the left-to-right nature of pLMs, existing pLMs modify suffix residues by prompting prefix residues, which are insufficient for the infilling task that considers the whole surrounding context. To find the more effective pLMs for protein engineering, we design a new benchmark, Secondary structureE InFilling rEcoveRy, SEIFER, which approximates infilling sequence design scenarios. With the evaluation of existing models on the benchmark, we reveal the weakness of existing language models and show that language models trained via fill-in-middle transformation, called ProtFIM, are more appropriate for protein engineering. Also, we prove that ProtFIM generates protein sequences with decent protein representations through exhaustive experiments and visualizations.Comment: Preprin

    Solvent: A Framework for Protein Folding

    Full text link
    Consistency and reliability are crucial for conducting AI research. Many famous research fields, such as object detection, have been compared and validated with solid benchmark frameworks. After AlphaFold2, the protein folding task has entered a new phase, and many methods are proposed based on the component of AlphaFold2. The importance of a unified research framework in protein folding contains implementations and benchmarks to consistently and fairly compare various approaches. To achieve this, we present Solvent, an protein folding framework that supports significant components of state-of-the-art models in the manner of off-the-shelf interface Solvent contains different models implemented in a unified codebase and supports training and evaluation for defined models on the same dataset. We benchmark well-known algorithms and their components and provide experiments that give helpful insights into the protein structure modeling field. We hope that Solvent will increase the reliability and consistency of proposed models and gives efficiency in both speed and costs, resulting in acceleration on protein folding modeling research. The code is available at https://github.com/kakaobrain/solvent, and the project will continue to be developed.Comment: preprint, 8page

    A community-powered search of machine learning strategy space to find NMR property prediction models

    Get PDF
    The rise of machine learning (ML) has created an explosion in the potential strategies for using data to make scientific predictions. For physical scientists wishing to apply ML strategies to a particular domain, it can be difficult to assess in advance what strategy to adopt within a vast space of possibilities. Here we outline the results of an online community-powered effort to swarm search the space of ML strategies and develop algorithms for predicting atomic-pairwise nuclear magnetic resonance (NMR) properties in molecules. Using an open-source dataset, we worked with Kaggle to design and host a 3-month competition which received 47,800 ML model predictions from 2,700 teams in 84 countries. Within 3 weeks, the Kaggle community produced models with comparable accuracy to our best previously published "in-house" efforts. A meta-ensemble model constructed as a linear combination of the top predictions has a prediction accuracy which exceeds that of any individual model, 7-19x better than our previous state-of-the-art. The results highlight the potential of transformer architectures for predicting quantum mechanical (QM) molecular properties

    Deep learning models for predicting RNA degradation via dual crowdsourcing

    Get PDF
    Medicines based on messenger RNA (mRNA) hold immense potential, as evidenced by their rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA molecules has been limited by their thermostability, which is fundamentally limited by the intrinsic instability of RNA molecules to a chemical degradation reaction called in-line hydrolysis. Predicting the degradation of an RNA molecule is a key task in designing more stable RNA-based therapeutics. Here, we describe a crowdsourced machine learning competition (‘Stanford OpenVaccine’) on Kaggle, involving single-nucleotide resolution measurements on 6,043 diverse 102–130-nucleotide RNA constructs that were themselves solicited through crowdsourcing on the RNA design platform Eterna. The entire experiment was completed in less than 6 months, and 41% of nucleotide-level predictions from the winning model were within experimental error of the ground truth measurement. Furthermore, these models generalized to blindly predicting orthogonal degradation data on much longer mRNA molecules (504–1,588 nucleotides) with improved accuracy compared with previously published models. These results indicate that such models can represent in-line hydrolysis with excellent accuracy, supporting their use for designing stabilized messenger RNAs. The integration of two crowdsourcing platforms, one for dataset creation and another for machine learning, may be fruitful for other urgent problems that demand scientific discovery on rapid timescales

    Deep learning models for predicting RNA degradation via dual crowdsourcing

    Get PDF
    Messenger RNA-based medicines hold immense potential, as evidenced by their rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA molecules has been limited by their thermostability, which is fundamentally limited by the intrinsic instability of RNA molecules to a chemical degradation reaction called in-line hydrolysis. Predicting the degradation of an RNA molecule is a key task in designing more stable RNA-based therapeutics. Here, we describe a crowdsourced machine learning competition ("Stanford OpenVaccine") on Kaggle, involving single-nucleotide resolution measurements on 6043 102-130-nucleotide diverse RNA constructs that were themselves solicited through crowdsourcing on the RNA design platform Eterna. The entire experiment was completed in less than 6 months, and 41% of nucleotide-level predictions from the winning model were within experimental error of the ground truth measurement. Furthermore, these models generalized to blindly predicting orthogonal degradation data on much longer mRNA molecules (504-1588 nucleotides) with improved accuracy compared to previously published models. Top teams integrated natural language processing architectures and data augmentation techniques with predictions from previous dynamic programming models for RNA secondary structure. These results indicate that such models are capable of representing in-line hydrolysis with excellent accuracy, supporting their use for designing stabilized messenger RNAs. The integration of two crowdsourcing platforms, one for data set creation and another for machine learning, may be fruitful for other urgent problems that demand scientific discovery on rapid timescales

    Ldlr 유전자가 제거된 마우스 모델에서 RELM-α 의한 당뇨성 동맥경화 감소효과

    Get PDF
    학위논문 (석사)-- 서울대학교 대학원 : 수의학과, 2017. 2. 이항.Resistin-like molecule (RELM)-α belongs to a family of secreted mammalian proteins that have putative immunomodulatory functions. Recent studies have identified a role of RELM-α in the pathogenesis of hyperlipidemia-induced atherosclerosis. However, whether RELM-α regulates diabetic atherosclerosis is unknown. Here we report that RELM-α has anti-atherogenic effects and protects against diabetic atherosclerosis in low-density lipoprotein receptor-deficient mice (LDLR -/-). Severity of the induced diabetic state was confirmed by monitoring of blood glucose levels and body weight. RELM-α overexpression appears to have a cholesterol-lowering effect. In particular, there was significant difference in cholesterol levels of diabetic group. After 8 weeks on a High-fat diet (HFD), total en face aortic lesion area was reduced in RELM-α overexpressing (RELM-α Tg) mice compared with control mice in both non-diabetic and diabetic group. Plaque area in the aortic arch was also decreased in RELM-α Tg of both groups. We show RELM-α overexpression has a higher anti-atherogenic effect with decrease of cholesterol in diabetic atherosclerosis compared with non-diabetic group. These findings define RELM-α as a novel therapeutic target for treating diabetic atherosclerosis.Introduction 1 Materials and Methods 8 1. Animal Studies and Diet 8 2. Genotyping 8 3. Antibodies 9 4. Immunoblotting 10 5. Streptozotocin Induced Diabetic Model and Mice Monitoring 11 6. Blood Analysis 12 7. Assessment of Atherosclerosis 12 8. Statistical Analysis 14 Results 15 1. The mice model of RELM-α overexpression 15 2. RELM-α overexpression reduces cholesterol in diabetic atherosclerosis mice 16 3. RELM-α overexpression reduces aortic arch plaque size 17 4. RELM-α overexpression decreases aortic root plaque size 18 List of Table 19 List of Figure 20 Discussion 30 References 38 Abstract in Korean 42Maste
    corecore